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Abstract 

Restricted solid on solid surface growth models can be mapped onto binary 
lattice gases. We show that efficient simulation algorithms can be realized on 
GPUs either by CUDA or by OpenCL programming. We consider a deposi- 
tion/evaporation model following Kardar-Parisi-Zhang growth in 1+1 dimen- 
sions related to the Asymmetric Simple Exclusion Process and show that for 
sizes, that fit into the shared memory of GPUs one can achieve the maximum 
parallelization speedup x 100 for a Quadro FX 5800 graphics card with re- 
spect to a single CPU of 2.67 GHz). This permits us to study the effect of 
quenched columnar disorder, requiring extremely long simulation times. We 
compare the CUDA realization with an OpenCL implementation designed for 
processor clusters via MPI. A two-lane traffic model with randomized turning 
points is also realized and the dynamical behavior has been investigated. 



1. Introduction 

Understanding basic surface growth models is an important task of statis- 
tical physics [l|, . The exploration of universal scaling behavior and pattern 
formation is a recent challenge as fabrication of nanostructures via self-organized 
bottom-up technology is getting widespread [3]. Simulation methods are very 
useful tools, since the solution of coupled differential equations is very difficult 
in general [J]. Mapping of simple surface growth models onto binary lattice 
gases has proved to be a very useful tool to understand basic phenomena via 
nonequilibrium reaction-diffusion system |5|]. Recently it has been shown that 
such mapping can be extended to higher dimensions 0, II] and more complex 
reactions, like surface diffusion ... etc. can also be treated in this way 
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The research of nonequihbrium lattice gases even in 1 + 1 dimensions is of 
current interest (for a review see [l^l)- Some of them are exactly solvable, pro- 
viding a framework to extend thermodynamical description to nonequilibrium 



systems (for a review see Disorder, which is present in almost all real 



systems has been investigated in these systems and found to be a relevant in 



many cases [24|, [22 1 



By coupling one-dimensional lattice gases one can create the simplest mod- 



els of multi-lane traffic [12[, or describe the cooperative behavior of molecular 



motors 13[. These are proteins capable to carry different cargoes from one part 
of the cell to the other IJ] . Several versions of such multi-lane system have been 
investigated recently, we concentrate on the numerical simulation of a model, 
in which the motion of particles in the lanes is opposite and they can jump in 
between the lanes randomly (l5| . 

The computational complexity of the simulation algorithms is high enough 
to consider an implementation on modern graphics processing units (GPUs) 
since the performance of the GPUs is much higher than that of normal CPUs 
for the same cost and power supply. Furthermore our application has commu- 
nication/computing ratio, that fits GPUs better than a cluster of CPUs. To 
gain this performance the problem must fit to the architecture of GPUs, i.e. the 
following factors must be fulfilled: 

1. The problem can be divided into sub-problems which can be computed 
independently (parallelization) using parallel threads. 

2. The parallel threads of the algorithm do not need to communicate directly. 
Data exchange can be performed using the memory (SMP approach). 

3. The amount of memory needed for the computation inside each parallel 
thread is limited. 

The first of the three points is obviously essential to create a parallel al- 
gorithm. The second point follows from the architecture of GPUs, where no 
bus system exists between processing elements. The third point is necessary for 
the performance of the algorithm only. Many articles in the field of statistical 
physics have already shown that there is a wide variety of simulation algorithms, 
which can benefit from GPU implementations [3, 3 IS 21 \ ■ 



Today's GPUs are mainly produced by two manufacturers: NVIDIA and 



ATI. The architecture differs slightly [25|, |26[ , but the main concepts are the 



same: a GPU device consists of a number of multiprocessors (NVIDIA Fermi: 
14, ATI Hemlock: 40), each equipped with several processing elements (NVIDIA 
Fermi: 32, ATI Hemlock: 16), an instruction block, shared memory and caches 
(see Table [1] for a summary). Each processing element (PE), which is also 
called CUDA core (NVIDIA) or Stream core (ATI) is a simple entity, containing 
arithmetic logic units, floating point and integer units, flow control and registers. 
The performance of PEs is limited, but due to parallelism the peak performance 
of the entire GPU is much higher than that of a multi-core CPU. 

For the programmer a GPU is a device, which executes kernels in parallel as 
a set of threads. There are vendor specific language extensions like NVIDIA's 
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Table 1: Key facts of the GPU cards from NVIDIA [2^ and ATI [5^. 





NVIDIA CI 060 


NVIDIA C2070 


ATI Radeon 5970 




or Quadro FX 5800 


(Fermi) 


(Hemlock) 


Number of multiprocessors (mp) 


30 


14 


40 


Number of processing elements / mp 


8 


32 


16 


Clock rate 


1300 MHz 


1150 MHz 


725 MHz 


Global memory 


4 GB 


6 GB 


2 GB 


Shared memory per mp 


16 kB 


48 kBi 


32 kB 


Memory clock rate 


800 MHz 


1500 MHz 


1000 MHz 


Global memory bandwidth 


102 GB/s 


144 GB/s 


256 GB/s 


Peak GFlop/s (single precision) 


936 


1030 


4640 


Peak GFlop/s (double precision) 


78 


515 


928 



CUDA or platform independent ones like OpenCL [27| , which moreover can be 
used to program other types of processors (DSPs, CPUs). 

The speedup, which can be achieved using one GPU (compared to running 
the simulation on a single CPU-core) is already remarkably high, but the run- 
time can still be long enough to consider multi-CPU simulations. On many 
systems it is possible to plug in more than one GPU card, but if more than, 
for example, four cards are to be used it is necessary to parallelize the program 
over a network using methods like message passing (MPI). 

Some early results from a first GPU version have already been published 
before In this article we will discuss implementations on different CPUs 

(NVIDIA and ATI) using CUDA and OpenCL for single-CPU versions and MPI 
for the multi-CPU code. 

2. 1+1 dimensional Kardar-Parisi-Zhang growth 

Surface growth in statistical mechanics has been investigated by simple mod- 
els, atomistic ones or continuum theories [H. Understanding of several funda- 
mental questions is still not complete. The Kardar-Parisi-Zhang (KPZ) equa- 
tion (29| . motivated by experimentally observed kinetic roughening has been 
the subject of large number of theoretical studies 0, Is^. Later it was found 
to model other important physical phenomena such as randomly stirred fluid 
[3ll | , dissipative transpor t I32l . l33| , directed polymers [s^l and the magnetic flux 
lines in superconductors [35|. It is a non- linear stochastic differential equation, 
which describes the dynamics of growth processes in the thermodynamic limit 
specified by the height function /i(x,i) 

dth{x, t)=v + (TV2ft.(x, t) + A(Vft.(x, t)f + r;(x, t) . (1) 



^64 kB can be split into 48 kB of shared memory and 16 kB of LI cache or vice versa. 
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Here v and A are the amplitudes of the mean and local growth velocity, a is 
a smoothing surface tension coefficient and tj roughens the surface by a zero- 
average Gaussian noise field exhibiting the variance 



Mx,i>/(x',0) = 2i?'^'(x-x')(t-t') 



(2) 



Here d is used for the dimension of the surface, D for the noise amplitude and 
denotes average over the noise distribution. 

The solution of KPZ in 1 + 1 dimensions is known exactly for different 
initial conditions 




39| , but in higher dimensions only approximations 
are available (see |5|). In d > 1 spatial dimensions due to the competition 
of roughening and smoothing terms, models described by the KPZ equation 
exhibit a roughening phase transition between a weak-coupling regime (A < Ac), 
governed by A = Edwards- Wilkinson (EW) fixed point [45|, and a strong 
coupling phase. 

Mapping of surface growth onto reaction-diffusion system allows effective 
numerical simulations 0, In one dimension a discrete, restricted solid on 
solid H realization of the KPZ growth 4^, 461 is equivalent to the Asymmetric 
Simple Exclusion Process (ASEP) of particles [41[ (see Fig.[TJa)). In this discrete 




q 



(a) 



(b) 



Figure 1: (a) Mapping of the 1-1-1 dimensional surface growth onto the Id ASEP model. 
Surface attachment (with probability p) and detachment (with probability q) corresponds to 
anisotropic diffusion of particles (bullets) along the Id base space, (b) The ASEP model. 

so-called 'roof-top' or 'single-step' model the heights are quantized and the local 
derivatives can take the values Ah = ±1. If we associate the up slopes with 
'particles' and down slopes with 'holes' of the base lattice (see Fig. [IJb)), the 
adsorption/desorption corresponds to ASEP. This is a binary lattice gas [i^ , 
[4^, where particles can hop on adjacent sites, with asymmetric rates, in case 
of vacancy. The behavior of ASEP is rather well known, and variations of 
ASEP (disorder, interactions ... etc.) can be related to variations of 1 -f 1 
dimensional KPZ growth models. When left/right symmetry in the hopping 
rates holds, we have the so called Symmetric Simple Exclusion Process (SSEP), 
which corresponds to the EW growth model with the scaling exponents: z = 2, 
a = /3 = (i.e. logarithmic growth). 

Usually the numerical solution of partial differential equations depends on 
the discretization scheme, however we used a mapping here onto binary vari- 
ables, where such discretization is not feasible. All gradients are approximated 



The height differences Ah{x, t) between neighbouring sites is finite 
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by '1/ — 1' slopes and we exploited the spatial scale invariance of the system, 
which is due to the diverging correlation length of these models. For CUDA we 
used coalesced memory access, which is optimal for smaller system sizes that fit 
into the shared memory. 

2.1. Realization by CUDA 

We have programmed the one dimensional ASEP model by CUDA within 
the framework of FX 5800 (or Tesla C1060) GPU cards as follows^. To achieve 
maximum performance we limited the maximum system size to L = 16000, 
hence using bit-coding technique one can store a ring (periodic boundary con- 
dition) within the shared memories of multiprocessors (in the FX5800 graphics 
cards there are 16 Kbyte shared memory/block). Therefore one multiprocessor 
block, containing 8 PEs, follows the evolution of a single chain and does not 
interact with the other 30 multiprocessors of the graphics card. Thus we have 
fast inter-PE communication during the simulations, and we need access of the 
slower card memory only at the beginning and at the end of a time loop. 

Each multiprocessor block performs a statistically independent run, with 
completely different random number sequence. This is achieved by the intelli- 
gent striding method of gpu-rng, the pseudo-random number generator of GNU 
developed for CUDA. This is in fact the parallel version of drand48, a standard 
linear congruential generator with 48-bit integer artihmctics [23]. This genera- 
tor has sufficiently long cycle (>'^ 2^^^^^) and the main program pre-shifts the 
seeds of the individual gpu kernel parts such that they do not overlap with each 
other. 

Starting from periodic, alternating "0/1" distribution, which corresponds to 
a flat initial surface, we update the particle model by the rules defined in the 
previous section. The difference between the standard ASEP and the CUDA 
realization is that we use parallel, two-sub-lattice updates instead of the random 
sequential one, which fits serial computers. To allow simultaneous, multiple PE 
updates without particle collisions we divide the ASEP chain (with binary state 
variables {si} G (0, 1) ,i — Q,...L — 1) into even and odd blocks (see Figl2]). 
Since there are less processors than sites, a single sweep of the lattice update is 
done via the sequence shown on Fig.O First we update the even sub-lattice with 
forward or backward jumps of 'I'-s with probabilities pi or qi+i (i = 0, ...L — 2) 
respectively. These are distributed among independent threads of the kernel 
function. The best performance was obtained when the number of threads was 
256 (for FX 5800). Then we update the odd sub-lattice sites {i = 1, ...L - 3) 
similarly. Finally we close the ring by jumping 'I'-s between the first and the 
last sites. Naturally in the totally anisotropic ASEP (TASEP) case the qi steps 
are omitted, and if there is no disorder Pi = p for all sites. When all sites are 
updated once the time {t) is incremented by one Monte Carlo step (MCs) and 



•^For other NVIDIA cards the algorithm is the same, one has to modify the hardware 
dependent quantities. For example in case of Fermi card one has larger shared memory, hence 
one can store larger systems. 
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6. V'k^ VIl^ 



Figure 2: Internal operation sequence of threads (numbered on the left) of a single lattice sweep 
of the parallel, two-sub-lattice update of ASEP algorithm, in case of independent jumping 
rates. 



we repeat these sweeps until t = tmax- For the stochastic cellular automaton 
(SCA) updates we need random numbers generated locally, by each exchange 
using gpu-rng in each thread. 

The equivalence of ASEP with SCA dynamics and random sequential up- 
dates (RSU) is not evident, therefore first we compared our results with the 
known results for RSU. At certain time steps we calculate the hx{t) heights 
from the height differences A/i = 2{si — 1) G (1,-1). The morphology of a 
growing surface is usually characterized by its width 

W{L,t) = E -ikil ■ (3) 

x—l x—1 

In the absence of any characteristic length, growth processes are expected to 
show power-law behavior of the correlation functions in space and height and 
the surface is described by the Family- Vicsek scaling [3] 

W[L,t) cx t*^, for to«t«ts (4) 
oc L", for t » ts . (5) 

Here a is the roughness exponent and characterizes the deviation from a flat 
surface in the stationary regime (i >> <s), in which the correlation length (S) has 
exceeded the linear system size L; and /3 is the surface growth exponent, which 
describes the time evolution for earlier (non-microscopic t » to) times. The 
dynamical exponent z, describing the space-time anisotropy, can be expressed 
by the ratio 

z^a/P . (6) 

The maximum simulation time (tmax) of runs depends on the time needed 
to achieve the steady state (saturation), hence on the system size L. We have 
checked this by inspecting the results on the log. -linear scale. Due to the par- 
allelization among multiprocessor blocks we obtain 30 independent realizations 
of the W{L, t) data at once and there is an external statistical loop in the CPU 
code to multiply this further. At the end of the program the W'^{L, t) data of 
each sample are averaged out by the CPU. 



We have also tested the effect of the p value. In the RSU case one can find 
the best KPZ scaling for p — 1, but here we have to choose p < 1, in order to 
introduce some randomness. Otherwise particles would just flip back and forth. 
A larger p helps to reach the saturation earlier, but compresses the dynamical 
scaling region. We have compared simulations with p = 0.5 and p = 0.8 and 
found the same scaling exponents. For the time being we fixed p = 0.8 as a 
compromise between speed and dynamical scaling region size. 

We have represented the state of an ASEP site (si) by the first bit of a char- 
acter variable (Ci), while we used other bits of Ci to store other, site dependent 
information. As we will see later in case of multi-lane ASEP, with quenched 
disorder this allows an effective CUDA algorithm and memory allocation. 

In case of extremely long simulation times tmax > 2"^^, which is necessary 
to achieve the steady state of larger systems we allowed to pass first and last 
elements of Si between the multiprocessor blocks and close the chain by the 
ends of the PEs of a GPU card. In this case we simulate a single system by 
one card. The communication loss, which is due to reading and writing of these 
characters to the device memory has been found to be small due to the fact 
that such events occur only once by every lattice sweep, i.e. with the frequency 
l/(L/30). 

2.2. Results for KPZ 

We have run dynamical simulations of the ASEP model with parallel SCA 
updates on rings of L = 2^, 2^^°, ...2^"* sizes 0. Each run was started from the 
alternating sequence of 'O'-s and 'I'-s, corresponding to the most flat surface. 
The probabilities of forward/backward jumps of 'I'-s was p — 0.8 and q = 
respectively. This so-called Totally Asymmetric Simple Exclusion Process 
(TASEP) is known to exhibit KPZ scaling in the thermodynamic limit. We 
sampled the evolution by exponentially increasing time gaps: 

t,_^i = * 1.05 if i^O, to ==30 (7) 

and calculated W'^{ti,L) at these times. For the longest run, by L = 16000 
the saturation was reached at ^ 10^ MCs and we followed the simulations up 
to 10^ in the steady state. This means 260 data sampling events, according to 
([7]), which requires negligible time compared to the length of the calculations 

4 minutes). This ensures that the serial sampling part does not alter the 
efficiency of the parallel CUDA code. 

By the scaling analysis we have subtracted the leading order correction to 
scaling from the width data, i.e the constant value W^{0,L) = 0.25, corre- 
sponding to the intrinsic width of the initial zig-zag state. As one can see on 
Fig. El excellent data collapse could be achieved for different sizes with the 1-1-1 
dimensional KPZ exponents: a = 1/2 and z = 3/2 



^ Other lattice sizes, which are not powers of 2 can also be simulated, but with a lower 
efficiency 
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Figure 3: Scaling collapse of the 1 + 1 dimensional parallel update KPZ (ASEP) by CUDA 
simulation results for sizes L = 512, IK, ...IdK. 



The parallel CUDA code run on Quadro FX 5800 card @1.3 GHz was com- 
pared with the same algorithm run on a CPU core (Intel 15 750 @2.67 GHz) 
and the speedup: x 100 was observed. This is near to the physical limit, 
taking into account that the graphics card has 240 PE-s with twice slower clock 
rate. This is the consequence of independent threads, i. e. the run-time data 
are packed in the fast shared or constant memories. The time needed to run 30 
realizations of a L = 1024 sized system up to tmax — 10^ MCs on a Quadro FX 
5800 card is only 4.2 seconds. 

2.3. Results for KPZ with quenched disorder 

Having confirmed the validity of KPZ scaling for the SCA CUDA algorithm 
we introduced quenched, site-wise disorder in the hopping rates, with the bi- 
modal distribution: 

P{p^) = {l~Q)5{p^-p) + Q5{p,-rp) , (8) 

i.e. we reduced the probability of exchange by a factor r at random sites selected 
with probability Q = 0.5 (here 5{x) denotes the Kronecker delta function). 
When we allow backward jumps as well we use the same distribution for P{qi). 
The sites with reduced hopping rates were tagged by the second/ (third) bits of 
the Ci variable corresponding to site i. 

The TASEP particle model with quenched, site-wise disorder (Q-TASEP) 
corresponds to KPZ growth with columnar disorder (QCKPZ): 

at/i(x, t) = v + aV^hix, t) + A(V/i(x, t)f + 77(x) . (9) 
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i.e. the noise field ri{x) is constant in time and independent of height. The 
Q-TASEP model was investigated by scaling arguments in [23| and emphasized 
that "numerical confirmations for the coarsening dynamics in case of site-wise 
disorder would be most welcome" . We have done this by Monte Carlo simula- 
tions of the parallel update SCA version of Q-TASEP. 

As Fig. m^a) shows we followed the evolution of W{t, L) beyond saturation 
for sizes: L = 2^*^, 2^^, 2^^, 2^^. The number realizations, with independent 
disorder distribution was 120 for each size. The speedup compared to the serial 
code on the reference CPU was about x45, depending on the parameters slightly. 
On Fig. m^b) we can see the same data rescaled by the exponents a — I, z — 1 
according to Eqs. Q and ([5]). The data collapse is reasonable if we take granted 
logarithmic corrections to scaling and we can confirm the results of ^24] indeed. 



More detailed analysis will be published elsewhere [47 1. 

To realize SSEP with disorder (Q-SSEP) it is important to have both forward 
and backward jumps of 'I'-s, otherwise the left-right symmetry is broken and 
we find normal scaling instead of an 'activated' one. As Fig. [5ja) shows an 
extremely slow convergence to saturation emerges for r — 0.2. The evolution of 
W{t,L) cannot be described by (HJ, but it follows the activated scaling law 

W{t,L) (x\n{tf (10) 

as shown on Fig. [DJb). This behavior is expected, when the so-called strong 
disorder fixed point dominates in the renormalization group sense (2^ . In this 
case the dynamics slows down and the relaxation time is characterized by 

ln(r) (X ^'l' . (11) 

Finite size scaling results in a good data collapse with the exponents a — 1, 
tp = 1/3 and 13 = 2.85(30). Further results of the data analysis wiU be discussed 
elsewhere (47i |. 
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Figure 5: (a) Surface width of (Q-SSEP) by CUDA simulations for sizes L = 
512,1024,2048,4096 (bottom to top), (b) The same, rescaled with the exponents o = 1, 
5/1 = 1/3. The dashed hne shows a fitting to L = 4096. 



24. Multi-GPU application by MPI and OpenCL 

Our intention was to find out whether different GPU architectures have sig- 
nificant impact on the run-time of simulations and to create a multi-card pro- 
gram. To achieve this we favored OpenCL over other vendor-dependent APIs. 
OpenCL is a high-level language, compared to NVIDIA's CUDA or AMD's 
CAL, which are optimal when one develops a program for a specific GPU fla- 
vor. However creating two implementations of the same algorithm, tuning both 
of them with vendor specific tricks is very time consuming, therefore we have 
chosen OpenCL and wrote an algorithm that relies solely on the general aspects 
of GPU programming within the framework of programs that execute across 
heterogeneous platforms consisting of CPUs, CPUs, and other processors 27 1 
and let the compilers to do the optimizations. 

The algorithm used in this program is similar to that of the previously 
explained CUDA code, however the implementation differs at many points. It 
is similar, since it uses bit level coding of the chain, however the state of lattices 
sites and the reduced probability locations are stored separately. They are not 
inside the same array of characters, but each of them is stored in an array of its 
own (see Fig. [6]) . This saves space from the shared memory in case of ASEP, 
allowing one to simulate longer chains, if needed. Furthermore, it has a second 
collateral benefit, which ultimately abolishes the need to use shared memory, 
lifting the limit imposed by the size of the shared memory of a Compute Unit. 
This provides a limitation on L only by the total VRAM size accessible to a 
GPU. 

As we saw earlier, updating lattice sites is completely independent at odd 
and even time-steps, hence we can organize our data, that allows processing the 
chain by vector operations. This optimization was used to utilize AMD GPU 
capabilities, namely the advantage coming from the fact that Stream Cores 
located in AMD CPUs are 4 unit wide vector-processors (with a fifth Special 
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Figure 6: Memory map of data in the OpenCL algorithm. 



Function Unit, inaccessible directly). This will not hinder NVIDIA cards, they 
have to deal with vector and matrix operations, unrolled to scalar ones either 
by the driver (in case of DirectX applications) , or by the compiler (in the case 
of OpenCL applications). 

The second change in the implementation is related to another optimization. 
It is a general GPGPU practice, that one tries to avoid flow control whenever 
possible. When there is no actual difference of the program flow in two branches 
of a control statement, besides the value of a single variable predication should 
be used. One should keep in mind that predication does not save computation 
required for the evaluation. It only removes the need of branching, which pro- 
gram flow to enter, hence it decreases the idle ALU time. This might seems to 
be an over-optimization, but practice shows that flow control throws back the 
performance and should be avoided whenever possible. 

Although less important in long simulations, if the sampling is done at rare 
time steps ([7]) , the output (t) is computed in a parallel manner on the GPU 
device. It is possible to add second moments of the height distribution with the 
proper correction, explained in ^48i] . The method implemented is a mixture of 
the two-pass algorithm and the pairwise method. This is useful if multiple, long 
chains are calculated, when it takes longer to collect data. 

As it has been stated before the limitation imposed by the size of the shared 
memory can be lifted if we avoid its usage. This is both a benefit and a drawback 
of high-level implementations, like OpenCL, that the explicit program behav- 
ior is masked from the programmer. Early implementations showed, that the 
program ran well, when more shared memory was allocated, than what was 
available to the card. This is due to the fact that even if excess data are stored 
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in the VRAM, it is still within the shared memory name-space and the compiler 
does not notify the programmer about the excessive memory usage. Bench- 
marks with this implementation showed that run-time did not increase, when 
some of the data was read from device memory. 

The two diagrams on Fig. [7] help us to understand why reorganizing data 
removes the advantage of shared memory usage. The diagrams show how a 
PE hides read latencies by thread scheduling inside a work-group. Dark red 
parts show useful work, beige parts show idle threads waiting for execution and 
crossed parts show idle thread time due to a memory read. When a thread 
requires data outside of its registers, the execution stalls until data arrives. If 
a stall occurs, the execution is given to another thread in the work-group, until 
it reaches a read command. If data are fetched in too small portions compared 
to the amount of time required to process it, scheduling won't be able to hide 
read latencies, the useful (red) work time will be too short to make up for 
the time needed to read. If lattice states are stored in vectors of integers, one 
vector read operation requires 128 lattice points to be updated (see Fig. [7] (a)). 
In comparison, if every lattice point is stored in a separate character, a read 
command is imposed in between lattice updates (see Fig. [TKb)). 




ffl 



= executng ^ ) = ready {not executing) 0()()\} = stallei 



= executing { ) = ready {not executing) (XXX} = stalled 



(a) 



(b) 



Figure 7: Latency hiding among threads (T0,T1,T2,T3) for vectorized (a) and non-vectorized 
execution 



The main reason for the OpenCL implementation was to investigate if this 
kind of problem fits clusters of GPUs. Solving the problem on a cluster can be 
approached in multiple ways. Using multiple GPUs one can simply increase the 
number of chains simulated in parallel to reduce statistical errors. Multi-cards 
can also be used to simulate larger chains, spanning over more than one GPUs. 

Implementing the latter idea would be an effort in vain for the following rea- 
sons. The algorithm should copy (a few bits) at every lattice sweeps. In order to 
optimize the work-time/data- fetch ratio by the vectorization, the ratio of inter- 
lattice communication and lattice sweep time is fixed. Even inside shader codes, 
the most expensive operations are synchronization (sync) commands. One sync 
command takes roughly 2 magnitudes higher execution time, than regular ADD 
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and MUL commands. Sync commands outside a GPU, but inside the computer 
are yet again roughly 2 magnitudes more costly, while synchronizing over a net- 
work takes even more time. When one sweep of a lattice points can be done 
very quickly, the efficiency of the implemented algorithm would depend on the 
amount of sync and copy commands. 

For this reason the OpenCL implementation avoids shared memory usage 
and one can consider large chains (typically L ~ O(IO^)). The multiple card 
processing capacity is exploited just to increase statistics, which is very useful, 
because a large number of independent disorder realization is necessary in these 
problems. On the other hand in higher dimensions we will need more memory 
than what is accessible by a single GPU, so the other approach must be followed. 

OpenCL does not have such a long history as CUDA does, compilers are still 
evolving and not all functionality of the standard are supported by them. This 
leads to slight modifications in the code to make it cross- vendor ready. Further- 
more, different architectural capabilities require to create an implementation, 
which uses the 'greatest common denominator' of capabilities, so to say. 

Note, that the CUDA and OpenCL implementations use different random 
number generators. The CUDA code uses gpu-rng, while the OpenCL utilizes 
a Mersenne- Twister type of generator, slightly modified to fit to the problem. 
The difference in the performances due to these generators were not inspected. 

We compared the codes by simulating TASEP chains without disorder. The 
performance of the OpenCL implementation on NVIDIA cards is inferior to that 
of the CUDA implementation, (at least for L < 8K) due to reasons encountered 
previously. However, higher dimensional simulations will most likely have to 
use the OpenCL like data structure, due to the very limited size of the shared 
memory. 




^ , , , , , , , , 1 I . , , , , , , , ^ 

512 IK 2K 4K 8K 1 6K 32K 64K 128K 512 IK 2K 4K 8K 1 6K 32K 64K 128K 

L L 



(a) (b) 

Figure 8: Run-time comparisons for different implementations (CUDA, OpenCL) of TASEP 
on different platforms (single ATI GPU, single NVIDIA GPU, single cluster node containing 
16 CPU cores) (a) and for the OpenCL implementation using 1, 2, 4 and 8 NVIDIA C1060 
(b). 

The performance scaling of the algorithm is nearly optimal, as one can expect 
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it, taking into account that different threads run independently from each other. 
Intimacies between the MPI and GPU drivers cause that ideal scaling cannot 
be achieved. The OpenCL implementation was tested on NVIDIA C1060 and 
C2070 cards, on ATI Hemlock cards and also on conventional cluster nodes with 
4 AMD Opteron quad-core CPUs. The CUDA implementation was tested on 
the NVIDIA cards. 

The common task to be done was running the simulations up to t^ax = 10^ 
MCs for 560 realizations. Although the the architecture of Tesla C1060 GPU 
is considerably different from that of TESLA C2070 we could use the same 
algorithms. OpenCL is responsible for rearranging memory stores/loads and 
ALU instructions to fit the architecture best. Differences between C1060/C2070, 
such as the L2 cache remain unexposed to the user, are not accessible directly. 
The compiler and the HW decide how to use them. In case of CUDA we used 
different thread numbers for C1060/C2070 to achieve the best performance for 
a each. The algorithms could be optimized further, if the development time was 
unlimited, but our code is far from being a first-approach, exploits both basic 
and nontrivial aspects of GPU programming (2^, . 

As one can see on Fig. |8lja) the run-time (T(seconds)) initially does not 
change in the case of OpenCL ran on CPUs as L is increased. This is because 
GPUs are not fully utilized on short chains and CUDA cores are idle due to 
the lack of work. Run-times begin to increase as all PE-s start to work. The 
constant time region is longer for the Fermi based Teslas, due to the higher 
number of CUDA cores inside a Compute Unit. The OpenCL version, running 
on 16 CPU cores does not have the constant part for smaller chain lengths, but 
the scaling is quite similar to that of the GPUs. 

The CUDA implementation on the Fermi cards runs significantly faster, 
especially for smaller chain lengths. The run-times grow almost linearly (T oc 
g^j]^ sizes: 512 < L < 32K, that can be simulated due to the shared 
memory limitation. 

Multi-CPU run-time comparison was done on a limited number of NVIDIA 
GPUs, since only 4 C2070 and 8 C1060 NVIDIA cards have been available for 
us. Fig.[5Jb) shows the scaling on the 8 Teslas (C1060); while Fig.inija) displays 
the same for the 4 Fermi cards (C2070). As one can see these are nearly optimal, 
the curves go parallel for all tested chain lengths and the speedup grows almost 
linearly with the number of cards. This is far from trivial in case of parallel 
algorithms running on multiple cards. The OpenCL frame will be very useful 
for further applications of this type. In Fig. |9l[b) one can see that the efficiency, 
defined as the speedup per node, of the multi-GPU implementations depends 
on the number of cards, as well as on L in a non-monotonic way. This enables 
one to optimize simulations with limited resources. 

3. Two-lane ASEP model simulations 

Next we turn to the numerical study of a two-lane model, with opposite, 
unidirectional motion in the lanes and with quenched, random inter-lane cross- 
ing probabilities [3l ■ This arrangement can be thought of a simplified model of 
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(a) (b) 

Figure 9: (a) Run-time comparisons for the OpenCL implementation using 1, 2 and 4 NVIDIA 
C2020. (b) Efficiency on 2, 4 and 8 NVIDIA C1060 and 2 and 4 NVIDIA C2070. The peak 
for L = IQK on NVIDIA C2070 comes from a small delay on the server in the referenced 
1-GPU simulation. 



cars or motors moving along two oppositely oriented roads or filament tracks. 
The motion of a single particle in this environment has been found to exhibit 
enhanced diffusion coefficient compared to that of the symmetric random walk 
[l6l | . This type of active diffusion can be realized experimentally and also present 
in vivo [l7|. Theoretical studies have concluded that, the steady-state behavior 



is richer than that of the one-lane ASEP, furthermore interesting phenomena 
take place in such systems, like synchronization or localization of density shocks. 

The bidirectional, two-lane model is built up from two TASEP rings, in which 
particles move forward in both lanes deterministically, to the right in A and to 
the left in B (see Fig.fTO]). and random lane-crossing probabilities: Ui, Vi. These 
are quenched, random variables with bimodal distribution (jSj. We have been 
interested in the dynamical behavior of this model by determining the exponent 
z via finite size scaling. In the steady-state the total current of particles {c{t — 
00, L) vanishes due to the left/right symmetry and its fluctuation < c{t, L)^ > 
is expected to grow linearly in time [16']. 



A 



\\ r' 



1 

... i-2 i-1 i i-i-1 i+2 ... 
Figure 10: The bidirectional, two-lane model defined in ref [l5 
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First we realized and tested the SCA version of this model on a CPU, second 
we implemented it by a CUDA program. In order to convert the originally 
random sequential updates to SCA, as in case of the ASEP, we decreased the 
in-lane hopping probabilities top — 0.5. One sweep (MCs) of the system consists 
of the parts executed in the following sequence: 

1. Even sub-lattice updates of the lanes A and B, with probabilities p = 0.5, 

2. A B lane-changes, with probabilities Ui, 

3. Odd sub-lattice updates of the lanes A and B, with probabilities p = 0.5, 

4. Closing boundaries of A and B, with probabilities p — 0.5, 

5. -B — > A lane-changes, with probabilities Vi. 

All local data are packed in the shared memories, in the vector of characters 
Ci (see Fig. [TT|) . The first bits of this vector contain lane- A states, the second 
bits lane-i?, the third and fourth bits {ui,Vi) mark the sites, where the crossing 
probability is reduced from 0.8 to 0.2. These are filled at the at the beginning 



c c c c c 

^2 ^3 ^4 ^5 ^6 



i=l, ...L(max 16K) 



Figure 11: Representation of the 2-lane ASEP model by a vector of characters Ci in the shared 
memory. 



of time loops by 'O'-s and 'I'-s randomly, with probability 1/2. The updates are 
done by bit-masking and using standard, bit-wise (AND,XOR,...) operations in 
parallel and independently by the threads. 

We increased/decreased a variable ( J ± 1), measuring the current, by each 
hopping event along the lanes A and B respectively. The simulations were 
started from half filled, random initial conditions of system with linear sizes: 
L = 2^, 2-^°, 2^^, 2^^. Sampling of the current density: c — {J / L)"^ was done 
in the same way as in case of ASEP simulations JT]). At the end of a time loop 
we calculated the total current density fluctuations of the system < c(t, L)"^ > 
(the mean value is zero due to the left/right symmetry). Averaging was done 
for ~ 1000 samples for each size. 

As we can see on Fig. [12] (a), following an initial transient time the shape of 
each curve changes. The dynamical behavior crosses over to a different regime. 
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(a) (b) 

Figure 12: (a) Fluctuation of the total current in the two-lane model for sizes L = 
512,1024,2048,4096 (top to bottom), (b) Finite size scaling and collapse with the same 
data. 



the steady state, where we can find hnear growth due to the diffusion Uke 
behavior of particles. Indeed if we divide < c(i, LY > by the time, a level- 
off to constant value can be observed. To determine the dynamical exponents 
we performed finite size scaling (see Fig. [12] (b)). The best collapse could be 
achieved by assuming the dynamical exponent z = 2, but stron g co rrections can 
also be observed. A more detailed analysis will be provided in [47[. 

The speed-up of the code on the FX5800 card compared to the reference CPU 
was about x30. If we allow particles to travel among multiprocessor blocks (via 
the VRAM) we can simulate much larger sizes, i. e. one samplc/GPU. In this 
case we could reach a speed-up x20. 



4. Conclusions 

We have realized efficient Id lattice-gas simulation algorithms for CPUs for 
the first time. This allows us to investigate 1 -I- Id surface growth or transport 
phenomena with extremely long relaxation times. This is typical in case of 
system with quenched random reaction rates. Furthermore in these system 
averaging must be carried out over a large number of disorder realizations, 
thus this problem fits the advantages of parallel processors. We have confirmed 
that the SCA version of ASEP is equivalent to the random sequential one, 
hence it can be well adapted for simulations with massively parallel GPU cores. 
We used two-sub-lattice updates, which can be generalized to checker-board 
lattice decomposition in two dimensions. Our preliminary results for ASEP with 
quenched disorder are in agreement with some previous expectations, detailed 
discussion is in preparation jJzj. 

By coupling two ASEP chains with lane crossing probabilities we have made 
the first step towards two dimensional simulations [7|, |28| , as well as we found 



interesting results for a traffic problem. We used bit coding to allow efficient 
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memory management, which is critical if the fast shared memory of GPUs to 
be used. 

Two similar code realizations have been compered in detail on different GPU 
hardwares. The CUDA code is faster for smaller system sizes, than the OpenCL 
one, but has limitations due to the size of the shared memory and the architec- 
ture. In particular for the simple ASEP problem we could achieve a speedup of 
~ 100 as compared to a traditional, present day single CPU. This is near the 
physical limits of the GPU card. The muhi-GPU (MPI) OpenCL code has been 
proved to be portable not only among GPUs of different vendors, but runs on 
clusters of CPUs as well. 

We have provided detailed scaling analysis of these codes and pointed out 
implementation specific questions. These results warrant for further interesting 
applications of GPUs in statistical physics and materials science [1, . 

5. Acknowledgments 

We thank R. Juhasz, 1. Borsos, K.-H. Heinig, J. Kelling and N. Schmeisser for 
the useful discussions. Support from the Hungarian research fund OTKA (Grant 
No. T77629), and the bilateral German-Hungarian exchange program DAAD- 
MOB (Grant Nos. 50450744, P-M6b/854) is acknowledged. The authors thank 
NVIDIA for supporting the project with high-performance graphics cards within 
the framework of Professor Partnership. 

References 

[1] A. L. Barabasi and H. E. Stanley, Fractal Concepts in Surface Growth (Cam- 
bridge University Press, Cambridge, 1995). 

[2] J. Krug, Adv. Phys. 46, 139 (1997). 

[3] S. Facsko, T. Dekorsy, C. Koerdt, C. Trappe, H. Kurz, A. Vogt, and H. L. 
Hartnagel, Science 285, 1551 (1999). 

[4] M. Castro, A. H.-Machado, and R. Cuerno, Phys. Rev. E 79, 021601 (2009). 

[5] G. Odor, Universality in Nonequilibrium Lattice Systems, World Scientific, 
2008; Rev. Mod. Phys. 76, 663 (2004). 

[6] H. Hinrichsen and G. Odor, Phys. Rev. Lett. 82 (1999) 1205. 

[7] G. Odor, B. Liedke, and K.-H. Heinig, Phys. Rev. E 79, 021125 (2009). 

[8] G. Odor, B. Liedke, and K.-H. Heinig, Phys. Rev. E 81, 031112 (2010). 

[9] G. Odor, B. Liedke, and K.-H. Heinig, Phys. Rev. E 81, 051114 (2010). 

[10] Nonequilibrium Statistical Maechanics in One Dimension, edited by V. 
Privman, Cambridge Univ. Press. 1997. 



18 



[11] R. J. Harris, G. M. Schdtz, J. Stat. Mech. (2007) P07020. 

[12] D. Chowdhury, L. Santen and A. Schadschneider A, Phys. Rep. 329, (2000) 
199. 

[13] D. Chowdhury, A. Schadschneider and K. Nishinari K, Physics of Life 
Reviews (Elsevier, New York), (2005) vol. 2, p. 318. 

[14] M. Schhwa and G. Woehlke, Nature 422, (2003) 759. 

[15] R. Juhasz, J. Stat. Mech. (2010) P03010. 

[16] S. Klumpp and R. Lipowsky, Phys. Rev. Lett. 95, (2005) 268102. 

[17] S. P. Gross, Phys. BioL 1, (2004) Rl. 

[18] T. Preis at al, J. of Comp. Phys. 228 (2009) 44684477 

[19] M. Weigel, Simulating spin models on GPU, Proc. of CCP2010, (2010), 
larXiv: 100638651 

[20] M. Weigel, Performance potential for simulating spin models on GPU, 
preprint ar Xiv:1101.142 7, submitted to J. Comput Phys. 

[21] M. Bernaschi, G. Parisi, L. Parisi, arXiv:1006. 25661 ^1 

[22] R. Juhasz, L. Santen, and F. Igloi, Phys. Rev. Lett. 94, 010601 (2005). 

[23] J. A. van Meel et al. Molecular Simulation, 34, 259 (2008) 

[24] J. Krug, Braz. J. of Phys. 30, 97-104 (2000). 

[25] NVIDIA Corporation, Whitepaper: NVIDIA's Next Generation CUBA 
Gompute Architecture: Fermi, version 1.1 (2010). 

[26] Advanced Micro Devices, Inc., Programming Guide: ATI Stream Comput- 
ing OpenCL, version 2.2 (2010). 

[27] Khronos Group, http : //www . khronos . org/ opencl| 

[28] H. Schulz, G. Odor, J. Kelling, K.-H. Heinig, B. Liedke, N. Schmeisser, 
Computing the KPZ Equation Using GPU Acceleration, 3rd International 
Workshop Innovation in Information Technologies - Theory and Practice, 
Dresden (2010). 

[29] M. Kardar, G. Parisi, and Y. Zhang, Phys. Rev. Lett. 56, 889 (1986). 

[30] T. Halpin-Healy and Y.-C. Zhang, Phys. Rep. 254, 215 (1995). 

[31] D. Forster, D. R. Nelson, and M. J. Stephen, Phys. Rev. A 16, 732 (1977). 

[32] H. van Beijeren, R. Kutner, and H. Spohn, Phys. Rev. Lett. 54, 2026 
(1985). 



19 



[33] H. K. Janssen and B. Schmittmann, Z. Phys. B 63, 517 (1986). 

[34] M. Kardar, Nucl. Phys. B 290, 582 (1987). 

[35] T. Hwa, Phys. Rev. Lett. 69, 1552 (1992). 

[36] M. Prahofer and H. Spohn, Phys. Rev. Lett. 84 (2010) 4882. 

[37] T. Sasamoto T, J. Phys. A: Math. Gen. 38 (2010) L54956. 

[38] T. Sasamoto and H. Spohn, Phys. Rev. Lett. 104 (2010), 230602. 

[39] G. Amh, L Corwin, and J. Quastel, I arXiv: 1003.0443 , to appear m Comm. 
Pure Appl. Math. (2011). 

[40] P. Meakin, P. Ramanlal, L. Sander, and R. BaU, Phys. Rev. A 34, 5091 
(1986). 

[41] T. Ligget, Interacting particle systems^ (Springer- Verlag, Berlin, 1985) 

[42] S. Katz, J. L. Lebowitz and H. Spohn, J. Stat. Phys. 34, (1984) 497. 

[43] B. Schittmann and R. K. P. Zia, Statistical Mechanics of Driven Diffusive 
Systems, (New York, Academic, 1995) 

[44] F. Family and T. Vicsek, J. Phys. A 18, L75 (1985). 

[45] S. F. Edwards and D. R. Wilkinson, Proc. R. Soc. 381, 17 (1982). 

[46] M. Plischke, Z. Racz, D. Liu, PRB 35, 3485 (1987). 

[47] G. Odor, R. Juhasz, in preparation. 

[48] T. F. Chan, G. H. Golub, R. J. LeVeque, Updating formulae and a pairwise 
algorithm for computing sample variances, STAN-CS-79-773, (1979). 

[49] M. Barma, M. D. Grynberg and R. S. Stinchcombe, J. Phys.: Cond. Matt. 
19 (2007) 065112. 



20 



